The aim of faltwerk is to help during spatial exploratory data analysis of protein structures. There are many use cases, but here we will focus on positively selected residues in great apes in the iron transporter "transferrin", identified by Barber et al. (2014, Science, link). In this beautiful study, they can trace this back to escape from iron piracy:
We show that the iron transport protein transferrin is engaged in ancient and ongoing evolutionary conflicts with TbpA, a transferrin surface receptor from bacteria.
Below we assume the state of knowledge Barber et al. had when they embarked on the study, which means we just found that several sites in transferrin are positively selected. We will use faltwerk to see if there is any spatial component to this finding, and generate hypotheses, that we can then follow up on (in the lab, for example, as did the authors). In fact, we assume even less. While the structures of transferrin and the transferrin-TbpA complex had been resolved by the time of the study, we will not use them, but predicted them de novo using AlphaFold2 (AF2).
Disclaimer: Remember that ALL analyses below are based on the protein sequence ONLY, which continuous to amaze me. However, therefore, one has to be careful to not place too much confidence on any individual technique, and seek othogonal evidence for any finding that might come up.
import json
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
import altair as alt
import numpy as np
import py3Dmol
import pandas as pd
from libpysal.weights import KNN, DistanceBand
from spreg import OLS, Probit
# https://github.com/pysal/spreg/blob/d464bbbc3c8601f1ca1989f4756967dca3a83179/spreg/probit.py#L704
from faltwerk.biochem import solvent_access
from faltwerk.geometry import is_close, get_complex_interface, distance_to_positions, get_alpha_carbon_atoms
from faltwerk.io import load_conserved, load_bfactor_column, load_conserved
from faltwerk.models import Fold, Complex, Binding, AlphaFold
from faltwerk.stats import find_hotspots, cluster
from faltwerk.vis import Layout, plot_alphafold
from faltwerk.utils import flatten
For the prediction of the transferrin protein structure we used AF2 as implemented in "ColabFold":
By default, AF2 predicts 5 models and we can choose the best accoding to some metric, where "pLDDT" is most commonly used. Values above about 70% indicate that the model is quite confident about the prediction. A superposition of the models can give an idea about the variance in the prediction. Behind the scenes, faltwerk will take the prediction results, rename chains, and align the structures (the latter using foldseek).
fp = 'data/alphafold2/transferrin/'
af = AlphaFold(fp)
Best model (pLDDT): test_08df6_unrelaxed_rank_1_model_3.pdb Align remaining models to best and rename Aligning test_08df6_unrelaxed_rank_2_model_4.pdb to test_08df6_unrelaxed_rank_1_model_3.pdb Renaming chain A to B Aligning test_08df6_unrelaxed_rank_3_model_5.pdb to test_08df6_unrelaxed_rank_1_model_3.pdb Renaming chain A to C Aligning test_08df6_unrelaxed_rank_4_model_2.pdb to test_08df6_unrelaxed_rank_1_model_3.pdb Renaming chain A to D Aligning test_08df6_unrelaxed_rank_5_model_1.pdb to test_08df6_unrelaxed_rank_1_model_3.pdb Renaming chain A to E
plot_alphafold(af)
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol
<py3Dmol.view at 0x176f61340>
We continue with the best model. To be more flexible in the visualisation of proteins we'll introduce the Layer object.
model = af.best
# What annotation tracks are present?
model.annotation.keys()
# dict_keys(['position', 'plddt'])
Layout(model, panel_size=(400, 300)).geom_ribbon('plddt', palette='rainbow_r').render().show()
# Blue is good
# Pick any color or color palette from matplotlib:
# https://matplotlib.org/stable/tutorials/colors/colormaps.html
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol
The model seems uncertain about how to fold the N-termus, but other than that, we seem to have a decent structure prediction.
Proteins are a diverse bunch of molecules, and individual residues can have very different properties based on their (relative) position in space. faltwerk implements some convenience functions to quickly annotate such features, e.g. "solvent access".
_ = model.annotate_('access', solvent_access(model))
Layout(model).geom_surface('access', palette='viridis').render().show()
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol
We now begin our scientific inquiry. Using several methods, Barber et al. arrived at several positions in the protein that seem to be under positive selection. What caught their attention was the spatial organisation of these positions. They seemed to be constrained to two regions on the linear protein sequence. Spatial clustering suggests that some part of the protein is more "relevant" to a biological process than others. The properties of this protein part can suggest what biological process could be responsible. For example, positive selection across binding sites could be caused by an evolutionary arm's race between two organisms, as is the case here (for detailed information please refer to the above mentioned manuscript).
original = [153, 253, 382, 434, 435, 436, 439, 558, 574, 575, 576, 591, 592, 593, 614, 617, 619, 625]
# -1 bc/ positions from are 1-based (publication) but Python has us work with 0-based indexing
barber2014 = [i-1 for i in original]
# Turn positions into a vector with one for each position under selection and 0 otherwise.
selection = [1 if i in barber2014 else 0 for i in range(len(model))]
We can approach the question "is there spatial signal in sites under positive selection" in two ways.
In (1) we try to find regions where a "patch" around a residue in our structure has some property, but by pure chance we would not expect such an aggregation in space to occur (random distibution). We supply a p-value and a maximum false discovery rate (FDR), to adjust for multiple testing. Below, we define the neighborhood of a residue as a sphere with a radius of eight Angstrom. We will use the Getis-Ord statistic to find "hotspots", and in a one-sided test we are only interested in hotspots with more selection than is expected (you could be interested in negative/ purifying selection, in which case you run the two-sided tests).
For (2) we use point density to cluster points that are close in space. Here, we take all residues identified as part of a hotspot of positive selection and cluster them. Basically, this segments hotspots, which can be useful in automated analyses. Imagine that two biological processes act on your favourite protein, one on a binding site and the other on the active center. If you encounter this case (unknowingly) in an automated analysis, you want to get two clusters, on which you can perform subsequent computations individually. For clustering faltwerk implements HDBSCAN and Markow chain clustering (MCL).
p = 0.05
FDR = 0.05
# (1)
hotspots = find_hotspots(
model,
selection,
method='getis_ord',
angstrom=8,
false_discovery_rate=FDR,
test_two_sided=False)
# (2)
clusters = cluster(model, hotspots, min_cluster_size=5)
# Annotate model
model.annotate_many_({
'selection': selection,
'hotspots': hotspots,
'clusters': clusters})
A note on visualisation. To build up a figure, we instantiate a Layout. We can then choose to only select() certain parts of the protein. Below, we only care about the carbon atoms of certain residues on chain A of the protein structure. We can then pass this selection to the "style" layers which start with geom_ (ggplot2 anyone?). You can layer on as many selections and styles as you want. By default they are placed in the first panel.
# Visualise
ly = Layout(model, panel_size=(200, 200), grid=(1, 3), linked=True)
pos = ly.select(residues=barber2014, elements=['CA'], chain='A')
ly.geom_ribbon(color='#ffffff')
ly.geom_sphere(selection=pos, color='black')
ly.geom_surface('hotspots', palette='binary', panel=(0, 1))
ly.geom_surface('clusters', palette='Set2_r', panel=(0, 2))
ly.render().show()
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol
We find two hotspots on the C-terminal "lobe" of the protein. While there are two positively selected positions on the N-terminal lobe, they don't seem to cluster. Are hotspots useful? Maybe. Imagine you only have a few species representatives to analyse. How likely do you find all sites that natural selection acts on? Hotspots "color in" regions that are spatially plausible, and it might or might not make sense to base further calculations on them rather than the original residues. However, faltwerk gives you the tools to quickly try both.
Proteins often bind stuff. Transferrin for example is an iron (Fe2+) shuttle. Let's visualise which residues bind iron, maybe there is some spatial relation. Note that faltwerk here uses the method implemented by Kiefl et al. in anvio based on work by Kobren and Singh (2018) called "InteracDome":
# TODO: adjust path to Pfam domains, see install instructions
pfam = 'path/to/pfam_v31/Pfam-A.hmm'
b = Binding(model, 'representable')
b.predict_binding_(pfam)
# b.domains
# b.interactions
binding = b.get_binding('PF00405.16', 'FE')
fe = [i for i, j in enumerate(binding) if j > .5]
ly = Layout(model)
# select
pos = ly.select(residues=barber2014, elements=['CA'], chain='A')
fe_ = ly.select(residues=fe) # carbon atoms of chain A are selected by default
# style the layer cake
ly.geom_ribbon(color='#ffffff')
ly.geom_sphere(selection=pos, color='black')
ly.geom_ribbon(selection=fe_, color='red')
ly.render().show()
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol
As expected, we find one Fe binding site per lobe. Note how residues distant on the linear sequence fold into spatial proximity to "hold" the iron molecule. Also, the sites under positive selection have no clear relationship to this center. But nevermind, let's just add the distance of each residue to the closest binding site residue as a regressor for later. faltwerk offers several functions that make this kind of "geometry" calculation as easy as:
model.annotate_('distance_to_fe', distance_to_positions(model, fe))
We want to add more features that could explain the selection pattern, so let's keep collecting. Transferrin is known to bind several proteins. Barber et al. used such protein complex structures from crystallography experiments, but AF2 predicts complexes surprisingly well. We will add "distance to binding interface" as a feature for our statistical work later. In terms of science, this TbpA is the iron pirate referred to in the manuscript.
fp = 'data/3V8X/complex/test_cacad_unrelaxed_rank_1_model_3.pdb'
cx = Complex(fp)
interface = get_complex_interface(cx, 10)
distance_to_interface = distance_to_positions(model, interface)
model.annotate_('distance_to_interface', distance_to_interface)
ly = Layout(cx)
B = ly.select(chain='B')
ly.geom_ribbon(selection=B).render().show()
<Chain id=B> <Chain id=C>
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol
We can also add that were calculated through other programs and prediction models, for example:
fp = 'data/transferrin_binding.pdb'
dmasif = [i for i in load_bfactor_column(fp)]
fp = 'data/conservation/isoforms.linsi.faa' # protein MSA in fasta format
conserved = load_conserved(fp)
model.annotate_many_({
'interface_prediction': dmasif,
'conserved': conserved
})
ly = Layout(model, grid=(1, 4), panel_size=(200, 200), linked=False)
# unlink, to turn individually
pos = ly.select(residues=barber2014, elements=['CA'], chain='A')
fe_ = ly.select(residues=fe) # carbon atoms of chain A are selected by default
ly.geom_ribbon(color='#ffffff')
ly.geom_sphere(selection=pos, color='black')
# Pick color(map)s from matplotlib
# https://matplotlib.org/stable/tutorials/colors/colormaps.html
ly.geom_surface('hotspots', palette='binary', panel=(0, 1))
ly.geom_surface('distance_to_interface', panel=(0, 2))
ly.geom_surface('interface_prediction', panel=(0, 3))
ly.render().show()
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol
We can see that residues close to the TbpA binding interface co-locate with our selection hotspots (Panel 3/4), while there is poor correspondence between the (more general) interface prediction based on multiple residue properties (Panel 4/4).
Now we can start to check out some ideas obtained from eyeballing the data. The model annotation can be easily turned into a pandas dataframe and explored using a wide range of visualisation tools, such as here altair.
df = pd.DataFrame.from_dict(flatten(model.annotation, expected_track_length=len(model)))
alt.Chart(df).mark_boxplot(extent=1.5).encode(
x='selection:O',
y='distance_to_interface:Q',
)
df
| position | plddt | access | selection | hotspots | clusters | distance_to_fe | distance_to_interface | interface_prediction | conserved | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 42.63 | 1.000000 | 0 | 0 | 0 | 66.680168 | 68.355728 | 81.53 | 1.0000 |
| 1 | 2 | 44.54 | 0.995968 | 0 | 0 | 0 | 64.177109 | 64.654755 | 82.79 | 1.0000 |
| 2 | 3 | 45.96 | 1.000000 | 0 | 0 | 0 | 62.222004 | 63.754223 | 87.52 | 1.0000 |
| 3 | 4 | 49.76 | 0.688679 | 0 | 0 | 0 | 60.561222 | 64.366493 | 88.10 | 1.0000 |
| 4 | 5 | 50.62 | 0.788732 | 0 | 0 | 0 | 57.877182 | 61.342251 | 89.19 | 0.9000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 693 | 694 | 89.39 | 0.183099 | 0 | 0 | 0 | 16.412100 | 27.741705 | 84.40 | 0.8053 |
| 694 | 695 | 90.23 | 0.142132 | 0 | 0 | 0 | 13.978737 | 26.945194 | 88.93 | 1.0000 |
| 695 | 696 | 77.40 | 0.229839 | 0 | 0 | 0 | 10.931679 | 24.255344 | 72.21 | 0.5579 |
| 696 | 697 | 64.85 | 0.379032 | 0 | 0 | 0 | 12.241253 | 26.450377 | 76.98 | 0.7316 |
| 697 | 698 | 45.53 | 0.889706 | 0 | 0 | 0 | 11.424846 | 28.426435 | 60.73 | 0.3000 |
698 rows × 10 columns
Once hypotheses have been generated, we need to test them. The typical regression framework assumes independent datapoints, which we clearly violate due to the spatial dependency (to neighboring residues are more likely to share properties than would be expected by chance). Luckily, there is a spatial regression framework (pysal), which we can easily interface with.
Where utter patternlessness or randomness prevails, nothing is predictable. -- "Real Patterns", D. Dennett, 1991
First we will test several variables against a binary dependent one, namely whether a residue is under positive selection or not. We could use the positions that have been identified by Barber et al., but here we will use the spatial hotspots identified above. Alternatively, we could just use single clusters in more granular analyses or if we suspect the hotspots to emerge due to multiple selective forces (cluster 1 from protein A, 2 from protein B etc.).
points = list(get_alpha_carbon_atoms(model, only_coords=True))
# Define what is a "neighborhood", either using k-nearest neighbors ...
w = KNN.from_array(points, k=8)
# ... or euclidean distance
angstrom = 8
w = DistanceBand(points, angstrom, p=2, binary=True)
w.transform='r'
y = np.array(df['hotspots'])
columns = ['distance_to_interface', 'distance_to_fe', 'access']
x = np.array(df[columns])
m = Probit(y, x, w=w, name_y='selection', name_x=columns)
np.around(m.betas, decimals=6)
# constant, then independent variables in order of appearance
array([[-0.790137],
[-0.059364],
[-0.015249],
[ 1.159931]])
print(m.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: CLASSIC PROBIT ESTIMATOR
-------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : selection Number of Observations: 698
% correctly predicted: 94.99
Log-Likelihood : -120.0490
LR test : 37.6178
LR test (p-value) : 0.0000
------------------------------------------------------------------------------------
Variable Coefficient Std.Error z-Statistic Probability
------------------------------------------------------------------------------------
CONSTANT -0.7901370 0.3086193 -2.5602320 0.0104602
distance_to_interface -0.0593643 0.0125276 -4.7386875 0.0000022
distance_to_fe -0.0152488 0.0184589 -0.8260964 0.4087494
access 1.1599307 0.3803480 3.0496563 0.0022910
------------------------------------------------------------------------------------
MARGINAL EFFECTS
Method: Mean of individual marginal effects
------------------------------------------------------------------------------------
Variable Slope Std.Error z-Statistic Probability
------------------------------------------------------------------------------------
access 0.1044783 0.0352032 2.9678680 0.0029987
distance_to_fe -0.0013735 0.0016062 -0.8551417 0.3924727
distance_to_interface -0.0053471 0.0009770 -5.4732144 0.0000000
DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST MI/DF VALUE PROB
Kelejian-Prucha (error) 1 22.273 0.0000
Pinkse (error) 1 877.230 0.0000
Pinkse-Slade (error) 1 377.509 0.0000
================================ END OF REPORT =====================================
There is a pretty significant association between solvent access and the TbpA binding interface and our positively selected sites. As a sort of negative control, we observe positive selection close to the Fe-binding site.
So let's try a regression where the dependent variable is continuous:
y = np.array(df['conserved'])
x = np.array(df[columns])
m = OLS(y, x, w=w, name_y='selection', name_x=columns)
print(m.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : selection Number of Observations: 698
Mean dependent var : 0.9117 Number of Variables : 4
S.D. dependent var : 0.1863 Degrees of Freedom : 694
R-squared : 0.1435
Adjusted R-squared : 0.1398
Sum squared residual: 20.725 F-statistic : 38.7509
Sigma-square : 0.030 Prob(F-statistic) : 3.678e-23
S.E. of regression : 0.173 Log likelihood : 236.970
Sigma-square ML : 0.030 Akaike info criterion : -465.940
S.E of regression ML: 0.1723 Schwarz criterion : -447.747
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
CONSTANT 0.9027318 0.0157495 57.3181461 0.0000000
distance_to_interface 0.0030399 0.0006872 4.4237379 0.0000113
distance_to_fe 0.0014898 0.0010739 1.3873124 0.1657919
access -0.2804807 0.0289132 -9.7007881 0.0000000
------------------------------------------------------------------------------------
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 6.453
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 487.265 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 3 143.576 0.0000
Koenker-Bassett test 3 65.069 0.0000
================================ END OF REPORT =====================================
Very conserved residues are deeper in the protein (less access).